KWAME NKRUMAH UNIVERSITY OF SCIENCE AND TECHNOLOGY
CANR
FACULTY OF RENEWABLE NATURAL RESOURCE

RICHARD DANKWAH

Tell: 0248127638

EXPLORATORY DATA ANALYSIS OF PHENOLOGICAL AND PRINCIPAL FLOWERING AND INFLORESCENCE EMERGENCE STAGE ON THEOBROMA CACAO.

Exploratory data analysis (EDA) is about detecting and describing patterns, trends, and relations in data, motivated by certain purposes of investigation. As something relevant is detected in data, new questions arise, causing specific parts to be viewed in more detail.

Data Preparation and Preprocessing

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

    Step 1 : Import the libraries
    Step 2 : Import the data-set
    Step 3 : Check out the missing values
    Step 4 : See the Categorical Values
    Step 5 : Splitting the data-set into Training and Test Set
    Step 6 : Feature Scaling
Importing necessary Libraries

Loading Our Data Set Into a Pandas DataFrame

With the above out puts, we drop the ID column in our dataset since it will have no necessary impact on our data Analysis in the future. Now we Load our DataFrame again to see our new data output and We can see That our ID column is dropped.

We inspect our expected DataFrame Columns and row and have idea of how many Data are presents for our Analysis. We find out that We have One Hundred (150) and Fifty (5)rows and Five Columns which looks perfect as expected base on our Data Collection period estimation.

• Inspecting our target Variable, that is our Columns

To have a descriptive understand of our Data We run a simple Descriptive Analysis to under stance important patterns such as

. Means, Which is the average of numerical variable, in our Dataset. 

. Standard Deviation, A higher standard deviation tells us that the distribution is not only more spread out, but             also more unevenly spread out. The standard deviation is the average amount of variability in our data set. It tells         us, on average, how far each score lies from the mean. 

The Sepal Length 0.83 Cm Standard Deviation is that, we can show which length of Sepal Length are within one Standard Deviation (0.83 Cm) of the Mean with a small Std, The Sepal width of a Std 0.43 CM of the Mean, Petal with 1.77 cm and Petal Width with small Std of 0.76 Cm. The Min tells us the Minimum value, The Max tell us the Maximum Value in our dataset, the 50%, 25% and 75% Inter quartile Range tell us what percentage of our total Population or data falls below a certain Value or number.

We count the number of Target and response Variable. We have evenly distributed and Balanced Variable in our data, that is the number of data points for every class is fifty (50), that is we have fifty response. Fifty response from Bekwai, Ofinso and Mampong.

In the above figure, we are plotting Petal length on x-axis and Petal width on y-axis.we are scattering all the points that we have and putting it on the plot and it is called a 2D plot because we are using 2 features on x-axis and y-axis.

Using Petal Lenght and Width features, we can esily distinguish data points from these three study site only that, Data Points from MAmapong and Ofinso have little similarities, that is some of these data points overlaps but can easily be separated using ths 2d Scatter plot. We are confident that Petal Width and Lenght have an association and it is to say thta, the larger the Petal lenght the larger the Petal Width as shown above.

From the Above Visualization, We can see that features from Bekwai and spread closer to each other than that of Ofinso and Mampong.The Widely spread Data Points in our DAta set is Mampong.

Plotly Plot Show Data Points for Petal Lenght and Width

Using sepal length and sepal Width features, we can distinguish Bekwai Data points from Ofinso and Mampong. Searating MAmpong and Ofinso is a little harder because the Data points for these Areas OveLaps on others. Moreover, we can see data Bekwai has higher Sepal width and lower Sepal Lenght as compared to Data points from Ofinso and Mampong. The Sepal Lenght for our Independent variable have some similarities.

Plotly Plot For Sepal Lenght and Width Showing Data Points

Sample Mean, Sample Standard Diviation, Sample Variance, Min, Max and Count of Target Variabel

The table above shows a presentation of a Descriptive Analysis on the Count of our feature Variables, the Maximum data point for each sample (Areasite), the Minimum value, the Average (Sample Mean x̄ = ( Σ xi ) / n) data point and the Sample standard Deviation in our data set.The standard deviation measures the amount of variation or dispersion of a set of numeric values. the variance is a measure of how far individual (numeric) values in a dataset are from the mean or average value. The variance is often used to quantify spread or dispersion. For Bekwai the Sample Variance is 0.12 cm from the Sample mean of 5.01 cm. image-3.png image-2.png

The sample standard deviation is the square root of the sample variance (S2) and it is denoted as the sample standard deviation as S. A high variance tells us that the values in our dataset are far from their mean. So, our data will have high levels of variability. On the other hand, a low variance tells us that the values are quite close to the mean. In this case, the data will have low levels of variability. The standard deviation measures the amount of variation or dispersion of a set of numeric values. Standard deviation is the square root of variance σ2.

Population Variance for Sepal Lenght, Width and Petal Lenght and Width

A high variance to the mean tells us that the values in our dataset are far from the mean. So, our data will have low levels of variability since most of our feature variables has low Variance to the mean except Petal Lenght which have close Variance of 3.09Cm to its mean of 3.75Cm. On the other hand, a low variance tells us that the values are quite close to the mean. In this case, the data will have low levels of variability.
image-2.png

Correlation PairPlot

Pair Plot represents the Relastionship between our target Variabls and our Response variable. We can see that Cocoa flower Data Points at Bekwai has large difference in its Characteristics when Compared to the Other study Araes. The Cocoa flowers have smaller Petal Width and Lenght while its sepal Width is high and its Sepal Lenght is Low. Similar kind of conclusion can be drawn for the other Study areas like Ofinso usually have average dimensions whether it is Sepal or Petal width and lenght. Mampong has high Petal Width and Lenght while its Sepals Width are small but has larger Sepal lenght.

Correlation coefficients are used to measure the strength of the relationship between two variables. Pearson correlation is the one most commonly used in statistics. This measures the strength and direction of a linear relationship between two variables. Values always range between -1 (strong negative relationship) and +1 (strong positive relationship). Values at or close to zero imply a weak or no linear relationship. Correlation coefficient values less than +0.8 or greater than -0.8 are not considered significant.

General Correlation Formularimage.png

Covariance is a measure of how two variables change together, but its magnitude is unbounded, so it is difficult to interpret. By dividing covariance by the product of the two standard deviations, one can calculate the normalized version of the statistic. This is the correlation coefficient image-2.png

A high covariance basically indicates there is a strong relationship between the variables. A low value means there is a weak relationship.

Correlation on the other hand measures both the strength and direction of the linear relationship between two variables

Pearson Correlation image-5.png

Correlation Coefficient Matrix image-6.png

Correlation Coefficient Matrix

The Correlation Matrix shows the Visualization of our Independent variables (Features). The deeeper the Color of the Matrix, the Higher of the Correlation Coefficient.

Histograms

The Histogram distribution will help us see the distribution of data for the various columns of our data set of our Cocoa flowers Measurements.

>The Highest frequency of sepal width of our Cocoa flowers is between 3.0 to 3.5 which is around 70.

>The Highest frequency of sepal length of our Cocoa flowers is between 5.5 and 6.0 which is around 35.

>The Highest frequency of petal width of our Cocoa flowers is between 0 to 0.5 which is around 50.

>The Highest frequency of petal length of our Cocoa flowers is between 0 to 0.5 which is around 50.

Histogram of Data Points of Sepal Lenght and Petal Width for Our Area Of Study

Univariate Analysis and Histogram

Univariate analysis is the most basic form of statistical data analysis technique. When the data contains only one variable and doesn’t deal with a causes or effect relationships then a Univariate analysis technique is used. For instance, in a survey of Cocoa flowering measurement, the researcher may be looking to count the number Sepal length and Width or Petal Length and Width. In our instance, the data would simply reflect the number, that is a single variable and its quantity as per in a table of Chat below or above. The key objective of Univariate analysis is to simply describe the data to find patterns within the data. This is be done by looking into the mean, median, mode, dispersion, variance, range, standard deviation.

Probability Density Function(PDF) Vs Cumulative distribution function (CDF)

We have four (4) Features in our data set, which Is Sepal Lenght and Width and Petal Width and Lenght. We will Visualized all four faetures to see similarities and diferences in these features without making Inference or Correlation among these fearture. The purpose is to decribe the Charateristics of oir features to find partens in them for future Analysis.

Histogram and Probability Density Function (PDF)

The histograms showns an accurate graphical representation of the distribution of numerical data that is our Cocoa flower measurement from various Study Areas. The Histogram shown an estimate of the probability distribution of our continuous variable (COcoa Flowers Measured in Cm) of our quantitative variable in our DataSet. In our first step, We use or python code call "bin" to "bin" the range of values that is, divide the entire range of values into a series of intervals and then count how many values fall into each interval.

COnstructing the Histograms, x-axis will be the petal length and Width and Sepal Lenght and With and the y axis is a count of the number of points that exist in the given range. By Using this plot we can able to observe how many points are there in particular regions. Histogram basically represents how many points exist for each value on the x-axis using y-axis.

For Petal Width of our Cocoa flowers, features or samples from Bekwai is more and easily separable which has their their Sepal Width between 0.01 cm to 0.5 cm. Meanwhile, Ofinso and Mampong has similar features in Width which shows in the overlapping area of out Visualization. The plot shows that Most of the Petal Width for these areas are between 1.0 cm to 1.10 cm for Ofinso and 1.4 cm to 2.5 cm for Mampong. The features from Bekwai show that it has a distinct Petal width features from Ofinso and Mampong

With Petal Lenght, Features from Bekwai have distinctive features from Ofinso and Mampong. With Petal Lenght for Bekwai, most of it's data points is between a Data point of 1 to 2cm. Offinso has its counts of Data points between 3 to 5cm and Mampong between 4 to 7cm

The recoreded measurement of our features from our study seems similar in Sepal lenght and they are seems overlapping among each other especially in the measurement of their Sepal Lenght and are inseperable.

From above observations we can say that Sapel length of our Cocoa flowers have similar lenght among eachother especcilly Offinso and Mampong, that is why we see these two Study area overlapping from th plot above.

Cumulative Distribution Function (CDF)

Plotting CDF Vs PDF

The Green line represent CDF and Blue line represent PDF. X-axis represent percentage and Y-axis represent Petal length of our Cocoa flowers.

From the Above Plot We can see the Blue Plot representing the Probability Density Function and the Green representing the Cummulative Destribution function

From our observation, the peak of our PDF shows that maximum number of Sample collected from Bekwai have a Petal lenght between 1.5 to 1.6 cm From our green line showing our CDF curvee shown that, the next peak of our data points is about 95% of samples Petal lenght from Bekwai have are =< 1.7cm Looking at our Curse again we can see a peak level at 100% (1.0) for our sample Petal Lenghts are =<1.9, Which mean none of our Sample petal Lenght at Bekwai is above 1.10cm.

Univariate Analysis using Box-plot and whisker

Box-plot and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set. Box-plot with whiskers: another method of visualising the 1-D scatter plot.

From the above Box plot, We can say that samples from Bekwai has less measurement except Sepal Width. The Hieght of the Box, shows the spread of our measurements or Data points. From the Upper extreme to the Upper Quartile show 75th percentile (1st Quartile) of our data points fall below or is equal to a certain Data point, the line in the box shows the Median and the Height of the Box show 50th percentils and implies, what Data points fall below or equal to a certin data point, the lowers whisker also shows 25th percentiles.

Our Data Visualization tell us the distribution of our Data points, the range or the heith of the upper quatile gives us the understanding that, for example Petal Lenght of sample data from Bekwai has very lower aprtal Lenght and Mampong having wider Petal lenght when comparing how data points are distributed accross the wider the wiskers and our Box, the wider our Data is spread accross.

> 75th Percentile of Petal Lenght at Mampong is below or equal to 7 cm to 5.9 cm
> 75th Percentile of Petal Lenght at Offinso is below or equal to 5.1 cm to 4.6 cm
>75th Percentile of Petal Lenght at Bekwai is below or equal to 1.8 cm to 1.5 cm
>The Data Points which fall outside the Box-plots are Outliers, which means they are usual Data points which fall below or above the general or Normal Data points in our Dataset. 

In General Statistics we can say that except Sepal Widtth Bekwai has the least flowring Data points or Measurement in all the three study Areas.

Univariate Analysis using Violin plots

A violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator

From the Violine plot the wide arae of the plot shows where most of its data points is highly distributed whilst the lower part of the plot shows where less of a data point is distributed. The wider the area of the plot the higer the distribution of data points.

Geographical Information Systems (GIS) Using Python

Geographical Information Of Study Area

The Measurement on the Map says Area cover from Bekwai to Mampong and Offinso cover about 195,435.10 Acres of land and 155.03 Kilometers Perimeter cover in distance. Linear Meauremnt Show that distance from Bekwai to Offinso is about 50.97 Kilometers, From Offinso to Mampong covers a distance of 32.81 Kilomiters and from Mampong to Bekwai also covers a distance of 70.98 Kilometers.

In all things if we are to site the factory in another district other than where the raw material is located, we are likely to incure additional cost on transportation.